Proactive error prediction to improve storage system reliability
نویسندگان
چکیده
This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction. In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.
منابع مشابه
Improving Storage System Reliability with Proactive Error Prediction
This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. anot...
متن کاملAnalysis of Probabilistic Error Checking Procedures on Storage Systems
Conventionally, error checking on storage systems is performed on-the-fly (with probability 1) as the storage system is being accessed in order to improve the reliability of the storage system. However, such a procedure may needlessly cause degraded performance due to the extra processing time needed for executing the error checking code. In this paper, we consider fault-tolerant storage system...
متن کاملA Proactive Fault Tolerance Scheme for Large Scale Storage Systems
Facing increasingly high failure rate of drives in data centers, reactive fault tolerance mechanisms alone can hardly guarantee high reliability. Therefore, some hard drive failure prediction models that can predict soon-to-fail drives in advance have been raised. But few researchers applied these models to distributed systems to improve the reliability. This paper proposes SSM (Self-Scheduling...
متن کاملArchitecting Dependable Systems with Proactive Fault Management
Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system’s complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure p...
متن کاملPrediction of fireball consequences caused by Boilover occurrence in the atmospheric storage tanks
Background and Objectives: Although Boilover occurs with a low frequency, but in case of occurrence, it can cause severe damage to people and equipment around the tank. The prediction of the fireball of Boilover phenomenon has an important role to play in adopting appropriate strategies for fire suppression of the atmospheric storage tank. The purpose of this study is to predict the consequence...
متن کامل